Forces from Stochastic Density Functional Theory under Nonorthogonal Atom-Centered Basis Sets

We develop a formalism for calculating forces on the nuclei within the linear-scaling stochastic density functional theory (sDFT) in a nonorthogonal atom-centered basis set representation (Fabian et al. Wiley Interdiscip. Rev.: Comput. Mol. Sci.2019, 9, e1412, 10.1002/wcms.1412) and apply it to the Tryptophan Zipper 2 (Trp-zip2) peptide solvated in water. We use an embedded-fragment approach to reduce the statistical errors (fluctuation and systematic bias), where the entire peptide is the main fragment and the remaining 425 water molecules are grouped into small fragments. We analyze the magnitude of the statistical errors in the forces and find that the systematic bias is of the order of 0.065 eV/Å (∼1.2 × 10–3Eh/a0) when 120 stochastic orbitals are used, independently of system size. This magnitude of bias is sufficiently small to ensure that the bond lengths estimated by stochastic DFT (within a Langevin molecular dynamics simulation) will deviate by less than 1% from those predicted by a deterministic calculation.


Fritz Haber Center for Molecular Dynamics and Institute of Chemistry,
The Hebrew University of Jerusalem, Jerusalem 9190401, Israel

S1. EVALUATING MATRIX ELEMENTS FOR FORCES
In this section we will describe how to calculate (δ C S) αβ = ∂ ∂X C S αβ δ C X: used in Eq. 11 in the manuscript.

A. The Pulay forces
Here we give the detail of calculating the ∂ ∂X C φ α φ β and ∂ ∂X C φ α ĥ KS φ β type matrix elements. For these we need to calculate the derivatives of the basis functions φ (r g − R C ) with respect to nuclear coordinates R C = (X C , Y C , Z C ). Since we use Cartesian Gaussian functions as our basis we enjoy the ease of calculating their values and derivatives analytically on the grid. Each atom-centered basis function is a sum of primitives, e −γx 2 x l ×e −γy 2 y m × e −γx 2 z n , where n+l+m is the total angular momentum quantum number. Therefore each primitive can be represented by three 1-dimensional vectors ξ α (x g − X C ), η α (y g − Y C ) and ζ α (z g − Z C ) and the function is defined inside a "window" surrounding the atom C and given as a product of three terms at each of its grid points:

S2
The grid is then used to evaluate the matrix elements of the overlap and Hamiltonian matrices: where h is the uniform grid-spacing.
Since the forces are the derivatives with respect to nuclear coordinate X C , while our grid points are the electronic coordinates, x g , we use the fact that for a basis function centered around R C , taking the derivative with respect to X C is simply the negative of the derivative with respect to x evaluated at r g − R C : such that derivatives can therefore also be defined inside a "window" surrounding the atom C and given as a product of three terms at each of its grid points. The grid is then used to evaluate the Pulay matrix elements of δ C S and δ C H: The above Pulay terms are non-zero only when the φ α basis belongs to atom C, i.e. when φ α ∈ C, and therefore need to be calculated only for a very small number of α, β pairs, i.e. only when φ α , φ β have overlapping windows on the grid. The result of these conditions is that Pulay force matrices are incredibly sparse and their number of non-zero elements, N non zero , is independent of the system size, as it depends only on the choice of basis set through the number of basis functions per atom. As for each degree of freedom (DOF), X C (X direction of atom C), we have to compute a Pulay force sparse-matrix, we exploit these conditions in our code by only evaluating non-zero matrix elements and storing them in sparse-matrix structures. In Subsection S1 C we describe the method of storage and application (onto a vector) of the sparse structure we have used.
The algorithm for computing ∂ ∂X C S αβ • For all basis functions α ∈ C loop over all basis function β < α that overlap with α, then: • For all basis functions α / ∈ C loop over all basis function β < α that overlap with α, then: • All other terms are not 0, and we do not store them in the sparse matrix structure.
(To simplify the code, as φ α ∂ ∂X C φ β = ∂ ∂X C φ β φ α , we always take the derivative from the left side). Overall we get: and the same can be done for the Hamiltonian Pulay terms: however, as the Hamiltonian includes terms that are explicitly dependent on nuclear coordinates in the form of the non-local (nl ) and local (loc) pseudopotential terms:v nl/loc pp nl/loc pp φ β type terms that contribute to the overall force on atom C in the X direction. See Subsection S1 B below for the detail of these force terms.

B. The direct forces
Here we give the detail of calculating the φ α ∂ ∂X Cv nl/loc pp φ β type matrix elements.
Since the non-local and local pseudopotential operators arev nl/loc pp = C ∈nucleiv nl/loc pp(C ) , the derivative with respect to a variation in nuclear coordinate X C is given by: Thev nl/loc pp(C) operators have an analytical expression of the Kleinman-Bylander form [1], such that we can apply it, and its derivative, ∂ ∂X Cv nl/loc pp(C) , on a vector on the grid. Due to its short-range nature,v nl pp(C) and subsequently ∂ ∂X Cv nl pp(C) are stored on a small "window" of grid points around R C . The φ α ∂ ∂X Cv nl pp(C) φ β matrix elements are therefore calculated as a multiplication of two grid vectors: where the sum over grid points r g is over only grid points that are inside the windows of all three terms, φ α , φ β and v nl pp(C) . This requirement of overlapping windows for all three terms results in very sparse matrices, with the number of non-zero matrix elements, N non zeros , dependent only on the choice of basis set and independent of system size.
For each degree of freedom (DOF), X C (atom C in the X direction), we have to compute a ∂ ∂X C V nl pp(C) force sparsematrix and apply it to a stochastic vector as part of the evaluation using the stochastic trace formula. In Subsection S1 C we describe the method of storage and application of the sparse structure we have used.

S4
The algorithm for computing • loop over all basis functions β that have overlapping windows withv nl pp(C) -Calculate the ∂ ∂X Cv nl pp(C) φ β (r g ) grid vector for all r g ∈ β ∩v nl pp(C) * loop over all basis functions α that overlap with both β,v nl The local PP operator,v loc pp , (which includes the long range Coulomb attraction), depends directly on the density n (r g ). In this case, an equivalent method to the trace calculation of the direct ∂ ∂X Cv loc pp is done by usual planewaves calculations on the grid using n (r g ) in reciprocal space. This scaling of this approach is quadratic with system size when the forces on all atoms are calculated, however it is highly efficient.

C. Sparse matrix structures for Pulay and non-local PP forces
To exploit the sparsity of the above, direct and Pulay, force matrices, we do not store them in full K × K matrix structures (where K is the number of basis functions), but rather use compact a storage structure. In these, for every degree of freedom we store three lists of length N non zeros : (val,I,J) ≡ M ij where (I,J) gives the location of the value, val, in the K × K matrix M .
As per the above example, the sparse structure allows for a significant reduction in the number of stored values in memory as we only store (3 × N non zeros ) elements per DOF as opposed to K 2 elements per DOF. Since the number of DOF's, N DOF ∝ K and since N non zeros is a small number dependent only on the choice of basis set (but not on system size!) our sparse structure reduces the scaling of the memory requirement from O K 3 to O (K) . The sparse structure also allows for an efficient matrix vector multiplication. For a matrix M and vector |z the operation |y = M |z is given by: such that we only require N non zeros multiplications for a matrix vector operation (as all terms of c, that are not in the I list, are zero).
For the stochastic trace formula we need to calculate expectation values using the stochastic vectors, χ. In a bra-ket notation, for a matrix M : such that we only require 2×N non zeros multiplications. Since the calculation of the force operators' expectation values (direct and Pulay), per DOF, are independent of system size, the overall scaling of the force calculations using this sparse structure is O (K).

A. Random variables
In order to understand the statistical errors involved in our procedures we briefly review the concept of a random variable r [2]. It takes any one of a discrete set of values {r} with a given probability p (r) ≥ 0, where r p (r) = 1.
both can also be viewed as random variables with appropriate probability functions themselves. It can be shown that 1. The expected value of m I is the same as that of r: 2. The variance of m I is smaller by a factor I than that of r: 3. The expected value of s 2 I is equal to the variance of: is the uncertainty giving a 70% confidence interval for E [r]. Based on the sampled data, there is a probability of 70% that E [r] falls within this interval.

B. Stochastic vectors
In section 2.4 of the main text we discuss the stochastic evaluation of observables using the stochastic trace formula where we treat each χ T Aχ as a random variable. The variance associated the result is given by The relation in Eq. (S2) is called the stochastic trace formula and it allows evaluating the trace of A by parameter estimation techniques based on statistical sampling theory.

S6
Proof of Eq. (S3) We begin the proof using the definition of the variance of a random variable, 2 , and considering χ T Aχ as our random variable: where in the second line we have used Eq. (S2), and in the third line the fact that the matrix A is completely deterministic. We will now evaluate E [χ k χ l χ i χ j ], using that E [χ i χ j ] = δ ij : and multiply by A kl A ij and sum over all indices: Substituting back into Eq. (S4), we arrive at

C. Parameter estimation and statistical errors
Often expected value E [r] of a distribution of a random variable r is not known. The estimation of this parameter can be done, based on the use of a finite sample r i of size I, as discussed in section S2 A. As an demonstration of this procedure we return to the question of how to evaluate the Tr [A]. We take a sample of I stochastic vectors χ i and form a random variable 1 and As discussed above, the sample mean m I = 1 involves K such applications. Therefore, as long as I K we obtain a large saving in the numerical effort, but at the price of introducing an uncertainty. Now, suppose we wanted to estimate a given function of the expected value of a random variable, f (E [r]) . The simplest procedure is apply f to the sample mean m I and take f (m I ) as such an estimate. This procedure works when f (x) is a linear function of x but otherwise will generally incur a systematic error, called a bias. Hence we have the undesirable case, that for a finite value of I, errors will involve fluctuations around the wrong value. Note however, that as I grows, the bias diminishes in proportion to 1 I . To be useful, when a bias exists, we need to make sure it is of sufficiently small magnitude. Force error eV/Å Figure S1. Histograms representing the distribution in the errors in the sDFT forces with I = 12 stochastic vectors. We show the errors of forces acting on two nitrogen atoms from the solvated-TrpZip2 system. On the left (right) panel is the data for the smallest (largest) standard deviation cases from all nitrogen atoms. The data for the histogram was collected by repeating the sDFT force calculation M = 2800 times and comparing the force vector F m C on each atom C to the corresponding deterministic (accurate) value F dDF T C . Thus we obtained 3 × M error values in the x-y-z components of the force, which represent the scatter of force components for the atom.

D. Distribution of random errors in the forces
In Fig. S1 we present histograms of the sDFT force errors (F m C − F dDF T C , m = 1, . . . , M ) on two nitrogen atoms in the solvated-TrpZip2 system. We selected the atoms that have the smallest/largest standard deviation. For both atoms, we find a Gaussian-looking distribution of the force errors centered around zero with a standard deviation within the bounds reported in the Manuscript.
is a stochastic correction to the dDFT calculation summed over all fragments. As per Eq. (S3) the variance in the stochastic trace is given by the magnitude of the off-diagonal matrix elements of O P − f P f , and so clearly, as f P f approaches P the stochastic trace will have much smaller variance.

S4. SYSTEM SIZE DEPENDENCY
We compare the statistical errors of sDFT forces of two Trp-zip2 peptide systems, each with a different number of solvating water molecules. The first is the system we presented results for in Section III of the manuscript, solvated by 425 water molecules, while the second is a smaller system, with only 195 solvating water molecules. To keep the systems as similar as possible we cropped the smaller system directly from the larger one, such that the 195 solvating water molecules that are closest to the peptide are identical in their geometry in both systems, and the embeddedfragment method was employed such that, for both systems, the entire peptide composes one fragment, while the remaining water molecules are split into fragments with an average number of 16 water molecules per fragment.
In Fig. S2 we present the uncertainty and bias estimates for the sDFT forces on the 20 nitrogen atoms of the peptide, in the two systems described above, for the case of I = 12 stochastic orbitals. To allow for a comparison of the bias estimates, we repeated the calculation a large, M = 2800, times. The uncertainty values of each nitrogen atom are near identical between the two systems. Since the errors fluctuate, even for 2800 repetitions, we compare the median value over all nitrogen atoms, and find that it is practically identical for the two systems. These results give indication to a weak dependency of the statistical errors, uncertainty and bias, on system size.

S5. COMPARISON WITH REAL-SPACE GRID REPRESENTATION
We have checked the speed of basis sets vs. real-space grid calculations (as described in Ref. [3]) with Si 35 H 36 as a benchmark. For the real-space grid calculations we used a grid of 64 3 points of spacing δx = 0.5a 0 and the wall time for the real-space sDFT/SCF iteration was one minute. For the basis-set sDFT calculation we used the same integration grid.   Table S1. Data for comparing speeds and wall times for real-space and basis-set calculations Here are the results of our analysis (summarized in Table S1): the Hamiltonian operation for STO-3G and 6-31G are a factor 40 and 4 respectively faster than that of the real-space grid. Furthermore, the energy range of the basis sets is much smaller, so the Chebyshev expansion length of STO-3G and 6-31G are a factor 13 and 8 shorter than that of the real space grid. As a result, the operation of the DM on a vector in STO-3G and 6-31G are a factor of about 500 and 30 faster respectively than in the real-space grid calculation. Therefore, the calculation of a single SCF iteration is much faster in the basis set calculations than in the real-space grid.
Next, we considered the fluctuation error in both calculations. Surprisingly, we found, that the standard deviation of a typical force component on Si, for the STO-3G and the 6-31G basis sets are larger by a factor 1.4 and 4.6 respectively than that for the real-space grid. This means that for a given standard deviation goal, the basis set calculations will require more CPUs by factors of about N p = 1.4 2 ≈ 2 and N p = 4.6 2 ≈ 21, respectively, than the grid calculation.
Summarizing, STO-3G calculations can be very fast relative to the real-space grid's, however, it is well known that STO-3G is not accurate enough to be a reliably useful basis. When run with the same number of processors, the overall numerical effort of the 6-31G basis is comparable to that of the real-space grid calculation. However, if one can allocate a factor of N p ≈ 21 more processors to the 6-31G calculation than needed by the real-space grid, the